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Abstract: Acyclic digraphs arise in many natural and artificial processes. Among the broader 
set, dynamic citation networks represent a substantively important form of acyclic digraphs. For 
example, the study of such networks includes the spread of ideas through academic citations, the 
spread of innovation through patent citations, and the development of precedent in common law 
systems. The specific dynamics that produce such acyclic digraphs not only differentiate them 
from other classes of graphs, but also provide guidance for the development of meaningful distance 
measures. In this article, we develop and apply our sink distance measure together with the single- 
linkage hierarchical clustering algorithm to both a two-dimensional directed preferential attachment 
model as well as empirical data drawn from the first quarter century of decisions of the United States 
Supreme Court. Despite applying the simplest combination of distance measures and clustering 
algorithms, analysis reveals that more accurate and more interpretable clusterings are produced by 
this scheme. 

Keywords: citation network, distance measure, acyclic digraph, community detection, clustering, judicial 
citations, dimensionality 



I. INTRODUCTION & MOTIVATION 

While a variety of algorithms exist for the analysis of 
undirected or cyclic graphs, e.g., social networks, com- 
paratively little work has been done on acyclic digraphs. 
Previous literature has focused particularly on the de- 
velopment of canonical random graph models or the ap- 
plication of algorithms for general graphs to this special 
class (|S], [S], [1]). While these initial papers have made 
important contributions, additional investigation of dy- 
namic acyclic digraphs (DADGs) still remains. 

Dynamic acyclic digraphs arise naturally in the con- 
text of document citation networks. In these networks, 
nodes represent documents and arcs represent the cita- 
tions from one document to another. Much of the previ- 
ous literature on citation networks, however, disregards 
the direction of these arcs (for a notable exception see 
Leicht, et al., 9 ). This choice results in undirected 
graphs with many cycles, and thus allows the applica- 
tion of a wide variety of well-developed algorithms. On 
the other hand, disregarding direction does discard infor- 
mation about time and the flow of dependency. 



Recent work demonstrates that applying methods for 
undirected cyclic graphs to citation networks may create 
difficulties ([!]). While it is important to identify defi- 
ciencies in existing methods, it is more helpful to develop 
alternative approaches designed to properly address these 
shortfalls. With respect to the context in question, we 
seek to develop domain-specific methods and measures 
for citation networks that take the acyclic digraph nature 
of these networks into account. In this article, we present 
a novel sink distance measure that provides better com- 
putational efficiency and qualitative accuracy than tra- 
ditional measures. 



II. PROPERTIES OF DYNAMIC CITATION 
NETWORKS 

A. Topological Ordering 

Since citation networks are dynamic acyclic digraphs, 
they feature a number of important properties that dis- 
tinguish them from other networks. The most funda- 
mental property of acyclic digraphs is that there exists 
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at least one topological ordering of the nodes ([I]). Such 
a topological ordering can also be used to index the dy- 
namic network G itself, where G is a nested set of in- 
creasing graphs {G'i,G2, • • • ,^1^1}. Each G„ is a copy 
of G„_i with the addition of the v}^ document of the 
topological ordering and its corresponding arcs. From 
this growth dynamic, it is clear that the most natural 
topological ordering is actually the chronological order- 
ing of the documents. 

This ordering implies the existence of another dis- 
tinguishing property for this class of graphs. Unlike 
many other growing networks, the set of arcs with non- 
zero probability at each time step can be explicitly con- 
strained. From a generative framework, this represen- 
tation acknowledges that arcs cannot assert relation- 
ships with unobserved nodes at later times [T^. For- 
mally, such a process evolves on a filtration and can 
sample from the set of possible arcs at time t given 
by = {{x,y) ■ X i y(Gt_i),y € y(G«_i)}, where 
V{Gt) is the set of vertices in the graph G* and t is 
the index corresponding to the topological ordering I18j. 
From a statistical framework, in which only the resulting 
graph is observed, the previous statement can be written 
T{x) < T{y) ^ P((x,y) G A{G)) = 0, where T{x) is the 
time that node x was introduced into the graph G. This 
asserts that certain events should not even be considered 
as possible in statistical models. 



B. Sinks and Dimensionality 

A fact that follows immediately from the existence of 
a topological ordering is that there is at least one docu- 
ment that makes no citations and at least one document 
that has never been cited. Documents that contain no ci- 
tations correspond to nodes with out-degree zero and are 
called "sinks." The first node in the topological ordering 
is always a sink. Sinks represent documents with no de- 
pendencies observed. Thus, with respect to the data pro- 
vided, they mark the introduction of at least one original 
or novel idea([TH]). Nodes that are not sinks then rely on 
one or more of the ideas provided in one or more sinks. 

Though the above conception of citation networks is 
simple and reasonable, it contradicts patterns often ob- 
served in empirical citation data. Namely, many docu- 
ments contribute novel ideas, but very few feature zero 
outbound citations. In order to confront this compli- 
cation and refine our initial conception, it is important 
to remember documents and citations exist in a high- 
dimensional space. Documents may contribute novel 
ideas in one dimension but draw support or comparison 
from other dimensions - we call these documents "weak" 
sinks, as opposed to "strong" sinks which make no cita- 
tions in any dimension. 

For a simple but concrete depiction of this problem, 
consider Figure 1 below, a hypothetical subgraph con- 
taining nodes a, 6, and c respectfully. Node a is a 
"strong" sink as it features no outbound citations. Node 



& is a "weak" sink as it cites a on the red dimension but 
generates no citations with respect to its blue dimension. 
Node c is not a sink, as it relies on b and does not contain 
the red dimension. 

Lacking appropriately granular data, it is often diffi- 
cult for researchers to separate the dimensions contained 
within the observable outputs of a given system. How- 
ever, the above example highlights the specific usefulness 
of dimensional data. It is important to note that dimen- 
sional data is only necessary to identify "weak" sinks but 
not "strong" sinks. For example, if dimensional data were 
removed from Figure 1 below, thereby removing the col- 
oring of the subgraph, only node a would be identified as 
a sink.[2g 




FIG. 1: Dimensional Data and "Weak" vs "Strong" sinks 

In the context of acyclic digraphs, consider a citation 
network comprised of linkages between academic articles. 
While citations to a given article often converge upon a 
particular dimension or aspect of the work, a given ar- 
ticle could be cited on the basis of any of its n dimen- 
sions. Building from the example offered in Figure 1, as- 
sume node h is an article containing both a novel method 
and an interesting substantive result. 21J If the blue di- 
mension represents the article's substantive topic while 
the red dimension represents the article's methodologi- 
cal contribution, then the basis upon node c cites node 
h and b cites node a could only be definitively revealed 
with dimensional data. 

While this example is trivial, it reveals a broader prop- 
erty of acyclic graphs. While an author can select any 
citation from the existing citation set, that author has 
no specific control over the basis upon which his or her 
work is subsequently cited. One positive feature of the 
sink method we offered herein is that it generally pre- 
serves the choices made by the author at the time of 
authorship. 



III. DISTANCE MEASURES 

Distance measures between nodes are the predicate to 
a wide variety of algorithms in machine learning. As 
noted earlier, we believe that distance measures employed 
should incorporate the properties of dynamic citation 
networks that differentiate them from other classes of 
graphs. We consider the distance between nodes in the 
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"citation" space, where all documents must orient them- 
selves relative to one or more sinks of information. 

An appropriate distance measure should decrease as 
two nodes share more information. The simplest such 
measure should consider the number of shared sinks be- 
tween two nodes. Given a node i and its set of ancestors 
Ai, the sinks of i are given by the set Si = {x : 5'^{x) = 
0,x G Ai} = SnAi. Here, 5+(a;) is the standard notation 
for the out-degree of node x and S is the set of all sinks 
of the graph G. 

Using this notation, we can represent the distance be- 
tween nodes i and j as the proportion of sinks they do 
not share: 

\s.us,\ 

where \x\ is the cardinality of set x. Though this dis- 
tance measure is linearly decreasing in the proportion 
shared, one can formulate a distance measure from any 
appropriately decreasing function. The remainder of our 
distance measures will feature this linear form, but the 
reader should keep in mind that this is only exemplary. 

Furthermore, this measure can be calculated quickly 
for all pairs of nodes, as its implementation involves little 
more than graph traversal and set operations. 

In the above distance measure, we weight all sinks 
equally. An alternative measure might weight the im- 
portance of each sinks by the number of unique ancestors 
shared between nodes i and j that are descended from 
a sink s of interest. This set of s-ancestors of node i is 
given by Ai^s = {x : s G Sx, x € Ai} = AiODg, where Ds 
is the set of descendents of s. This can be interpreted as 
the ancestors of i who carry the information from sinks 
s. If even more detail is desired, we might modify the 
above measure to also incoporate the set of paths from 
nodes i and j to the sinks of interest. We let Ps^i be the 
set of path tuples (xi, . . . , a;„) from s to i. The resulting 
general equation takes the form 

_^ Sses,nSj /(^^^s' -^^s' ^J.s' -^i's) 

'^seSiUSj fi^i^s, Pi,s, Aj^s,Pj,s) 

Straightforward choices of / involve the cardinality of 
these sets, but some care must again be taken if one de- 
sires a distance metric that obeys all axioms. If needed, 
these functions may take on much more complexity. For 
instance, the importance of a sink might decay as its 
shortest path length increases. Such fine-grained choices 
however require substantive justification drawn from the 
given problem at hand. One should also note that path- 
based algorithms are likely to be more complex than ei- 
ther of the first two measures. 

The above distance measures all bear some similar- 
ity to the Jaccard similarity measure, as they involve 
intersections in the numerator and unions in the denom- 
inator ([7]). However, the standard Jaccard similarity 
index only takes into account the neighbors of each node 
and ignores nodes of any further distance. Thus, the 



above distance measures may better capture and weight 
"shared ancestry" than the Jaccard similarity measure. 



IV. APPLICATIONS 

Once a distance measure has been offered, a number 
of interesting research questions become relevant. One 
question of particular interest is whether a given graph 
exhibits detectable clustering or community structure. A 
significant amount of recent scholarship has been devoted 
to the production of community detection algorithms for 
general graphs ([H])- However, as noted earlier, in the 
context of acyclic citation networks, there are a number 
of issues that may impact the both accuracy and time 
evolving stability of substantive results returned by tra- 
ditional community detection methods. 

One important issue is dimension frequency - that is, 
some topics may occur much more frequently than oth- 
ers in the overall network. For example, suppose that a 
node z primarily concerns dimension di , but also touches 
upon dimension d2 ■ If subsequent documents more often 
confront dimension d2 than di, it is possible that z could 
receive more (i2-related citations than di citations. As 
a result, traditional community detection methods are 
more likely to cluster document z with d2-related doc- 
uments than with di documents. Though this example 
illustrates the evolutionary nature of documents within 
such citation networks, it is clear that traditional commu- 
nity detection algorithms may produce clusterings that 
differ from a given researcher's goals. 

Instead, one might seek to cluster documents in a man- 
ner consistent with the citation choices of the author at 
the time the document was written. In this case, sink- 
based distance measures as presented above might be a 
good choice for clustering. Take the above node z as an 
example. Suppose a node z has three sinks linked to di- 
mension di , but only one sink dealing with dimension d2 . 
Even if many more d2 documents cite z, they can only 
share one of four sinks at most. By contrast, di docu- 
ments can share up to three sinks with z. Thus, regard- 
less of the number of citations from di and d2 documents, 
z can still be closer to di documents. Though many 
gradations of this example exist, when confronted with 
unnormalized and high-dimensional citation information, 
sink-based distance measures are likely to be more robust 
to this issue than traditional community detection meth- 
ods. 

To test whether meaningful clustering can be derived 
from these sink distance measures, we apply equation (1) 
to two networks below, a theoretical model generated by 
two-dimensional directed preferential attachment and the 
other from substantive data offered in the citations of the 
early United States Supreme Court. 
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A. Comparison on a Random Model 

In Section II. B above, we argue that a number of is- 
sues can cause problems with existing community and 
clustering algorithms. To test this claim, we have gen- 
erated realizations from a citation model based on two- 
dimensional directed asymmetric preferential attachment 
(12]). 

The model has two types of nodes - red and blue. At 
each model step, a new node is introduced into the net- 
work. With probability Ir, a node will be red, and thus 
the complement h is the probability of the node being 
blue. To determine how many citations this node will 
make, we sample a uniform random integer between 1 
and TO. These citation edges are assigned according to 
the directed preferential attachment model, where red 
nodes have probability p^r of citing red nodes and prob- 
ability Prf; of citing blue nodes. Likewise, blue nodes have 
probability pi,r of citing red nodes and probability ptf, of 
citing blue nodes. For initial conditions, there are a Ur 
initial red nodes and rib initial blue nodes. 

In order to demonstrate the problems described above, 
we choose the parameters of the model to emphasize our 
example from Section II. B. The node type rates are given 
by = I ; 4 = I , the maximum number of edges per 
node is given by m = 3, the preferential probabilities are 
given by Prr = ^,Pbr = \,Pbb = |, and the initial num- 
ber of nodes of each type are given by rir = 2, rib = 1- 
This models a system with two dimensions where each 
node may only have one dimension, and one dimension 
occurs much more frequently than another. Furthermore, 
though one dimension is perfectly homophilic, the other 
attaches to both. Figure 2 below shows an example re- 
alization of this model, where the large squares denote 
sinks. 




FIG. 2: Realization of Random Model 



To justify our method, we compare our sink-based ap- 
proach with directed edge-betweenness (jE]). First, we 
apply equation (1) to calculate a full distance matrix for 
all pairs of nodes. Using this matrix, we then apply a 
single-linkage hierarchical clustering algorithm to these 
distances (tllj- The resulting dendrogram and its im- 
plied clustering are shown in Figure 3(a) and Figure 2, re- 
spectively. Next, we apply the directed edge-betweenness 
algorithm to produce the merge dendrogram in Figure 
3(b). Both Figures 2 and 3 are generated from the same 
underlying network visualized in Figure 2. The number- 
ing on both figures corresponds to the nodes, and the 
letters A through E correspond to the clusters detected 
by the sink method in Figure 3(a). 

The differences in Figure 3 are striking, though not 
entirely unexpected. The sink method in Figure 3(a) 
identifies five "communities" of nodes at identical branch 
location. From top to bottom, these branches correspond 
to (1) nodes that trace back only to node 2, (2) nodes 
that trace back only to node 1, (3) nodes that trace back 
only to node 0, (4) nodes that trace back to all three 
sinks 0, 1, and 2, and (5) nodes that trace back to both 
nodes and 1. 

Since the edge-betweenness algorithm produces binary 
branching dendrograms like most agglomerative and di- 
visive algorithms. Figure 3(b) exhibits a great deal more 
complexity than Figure(a). This complexity is sometimes 
warranted; however, it is often the product of ties in the 
agglomerative or divisive decision criteria. Since the sink 
method relies only on hierarchical clustering, it places 
nodes with equal distance at an equal branch position. 
In this case, the sink method identifies clusters that are 
closely related to the underlying network formation pro- 
cess. 



B. Results for the United States Supreme Court 
Citations 

To generate applicability beyond the context of a the- 
oretical model, we applied our approach to the case-to- 
case citation network contains the first quarter century 
of decisions of the United States Supreme Court. The 
structure of this network of citations is of interest to a 
wide variety of scholars including not only law professors 
and social scientists, but also members of the physical sci- 
ence community ([H], [5], [S], [H]). While it is possible 
to perform community detection analysis over the total 
body of Supreme Court decisions, we selected a reduced 
window of decisions in order to qualitatively examine the 
results of our algorithm ([22 

The Court's early citation practices indicate an ab- 
sence of references to its own prior decisions. While the 
court did invoke well-estabHshed legal concepts, those 
concepts were often originally developed in alternative 
domains or jurisdictions [23l. At some level, the lack of 
self-reference and corresponding reliance upon external 
sources is not terribly surprising. Namely, there often did 
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FIG. 3: Clustering Dendrograms 



not exist a set of established Supreme Court precedents 
for the class of disputes which reached the high court. 
Thus, it was necessary for the jurisprudence of the United 
States Supreme Court, seen through the prism of its case- 
to-case citation network, to transition through a load- 
ing phase. During this loading phase, the largest weakly 
connected component of the graph generally lacked any 
meaningful clustering. However, this sparsely connected 
graph would soon give way, and by the early 1820's, the 



FIG. 4: Hierarchical Clustering of Supreme Court Decisions, 
1820 



largest connected component displayed detectable struc- 
ture. 

Despite applying the most naive assumptions and least 
complicated clustering algorithm, our qualitative analy- 
sis reveals that accurate clusterings are produced by this 
scheme. By applying our sink clustering method, we ob- 
tain a dendrogram of the network's largest weakly con- 
nected component shown in Figure 2. The coloring in 
both Figure 2 and Figure 3 corresponds to two large 
clusters in the network. Edges are colored blue or red 
if the head or tail are of the respective group. Edges 
colored green span across these groups. 

Both of these colorized clusters engage questions re- 
lated to admiralty While not a major focus of the 
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FIG. 5: Largest Weakly Connected Component of the Net- 
work of Supreme Court Decisions, 1820 

docket of the modem court, the early court elaborated a 
number of important legal concepts through the lens of 
these admiralty decisions. However, despite their general 
substantive relatedness, these two clusters of cases engage 
different substantive sub-questions, and are thus appro- 
priately divided into separate clusters. For example, the 
red group of cases engages questions of presidential power 
and the laws of war, as well as general interpretations 
of the Prize Acts of 1812. Meanwhile, the blue cluster 
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V. CONCLUSION 

We present a novel conception of distance for the class 
of dynamic citation networks that has trivial implemen- 
tation and runtime. We successfully apply our sink ap- 
proach to not only a theoretical model but also to the 
citation network of the first quarter-century of United 
Supreme Court decison. Our method obtains substan- 
tively meaningful clustering and is less susceptible to 
some issues that arise within a high-dimensional citation 
space. 

Although the substantive application presented here 
focuses on the decisions of the U.S. Supreme Court, 
the applicability of this method is likely not limited 
to judicial citations. For instance, one could imagine 
tracing the spread of innovation using sink clustering of 
the patent citations, or the spread of ideas in an analysis 
of academic articles. In future work, we hope to apply 
this method to such domains, including dimensional 
data where available. 
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